details widget name

Common language processing tools

Chapter details

JRC Names annotator

JRC Names is a highly multilingual named entity resource for personal and organisation names, called 'entities'. It consists of large lists of names and their many spelling variations (up to hundreds for a single person), including across scripts (Latin, Greek, Arabic, Cyrillic, Japanese, Chinese, etc.). The JRC names resources are updated on a daily basis. The JRC Names annotator is a UIMA wrapper for the JRC names resources. Being a miltilingual named entity resource, the JRC Names annotator is suitable to be added in each of the LPCs in the project.

Paragraph splitter

The paragraph splitter is based on regular expressions „((^.*\S+.*$)+)”. More information can be found in the com.tetracom.uima.text.ParagraphSplitter class code.

URL and Email annotator

This tool is based on regular expressions. The URL and emails contain „ .” (dot), which confuses the subsequent components. Thus, URLs and Emails found in the text are annotated as named entities and skipped by the other annotators in the chain.

ParseEst

The BG tokenizer, BG NP extractor and BG NE recogniser use ParseEst - a generic tool for crafting, compiling and applying linguistic rules. The tool can be used for other languages and for different tasks involving development of language grammars. The rules are formulated in the ParseEst XML based formalism. ParseEst consists of two main modules: lr_builder and lr_engine.

  1. ParseEst lr_builder - a tool for compilation of linguistic rules as a finite state transducer.
  2. ParseEst lr_engine - a tool for the application of linguistic rules over an annotated text.
  3. Lexicon compiler - the BG sentence splitter and BG NE recogniser use the Lexicon compiler – a generic tool for compilation of large lexicons, which can be exploited as a relatively language independent tool, used for different purposes. A lexicon is a list of words, collected according to certain criteria - i.e. family names, proper names, etc. The lexicon definition file specifies available lexicons and defines union operations among them. All lexicons are compiled as a finite state automaton.

Performance report

The last component in each language processing chain is the “Performance report” provider. The annotated text is enriched with a performance report containing the overall processing time (in milliseconds) for the whole document, as well as the processing time for each primitive engine (annotator in the chain). Each performance report is then stored in a database so that the LPC can be evaluated in terms of productivity.